Take Home Assignment - Origin Financial

Overview

This report aims to show all processes developed to analyze the data challenge proposed by the Origin Financial team. It’s organized into four main sessions:

The first session will be Engineering Raw Data where I’ll describe all the steps needed to clean raw data. The following step, Exploratory Data Analysis has some descriptive analysis about the data. At this point, I’ll guide my analysis to data clustering based on RFM (Recency Frequency and Monetary). After that, I’ll show the clustering analysis at session Clustering and some insights based on it. Finally, I have some notes and future ideas further some conclusions about the challenge.

Engineering Raw Data

Before starting the data analysis, let’s do a quick review of available data sets and raise the need for pre-engineering.

Customers

## Rows: 803
## Columns: 7
## $ id            <chr> "131d8363-e2a6-4c60-acef-da03e6bfbdc4", "831216da-744b-4…
## $ created_at    <chr> "2020-03-11 19:38:35", "2020-05-01 13:09:00", "2020-03-1…
## $ date_of_birth <chr> "1990-08-16", "1933-11-22", "1986-02-02", "1990-01-01", …
## $ gender        <chr> "male", "", "", "male", "", "", "", "male", "male", "mal…
## $ country       <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", "U…
## $ state         <chr> "CA", "WA", "NY", "CA", "", "CA", "", "CA", "NY", "CA", …
## $ city          <chr> "San Francisco", "Tonasket", "New York", "San Francisco"…

In a quick overview, it’s possible to note that variables created_at and date_of_birth should be modified to timestamp and date, respectively. Also, we can note that all missing data are represented by empty character space and it will be replaced by NA.

The variables state and city have some strange characters like this: <img src='#' onerror=alert('xss') /> I presume that this behavior is caused by a css operator and it’ll be replaced by NA.

The variable id it’s lowercase, while in the transactions data it’s upper-cased, so it’ll be replaced by uppercase.

Transactions

## Rows: 11,059
## Columns: 17
## $ X_id                                     <chr> "4D2119A1-03D9-48F2-99B7-FEDD…
## $ user_id                                  <chr> "94102846-0B6D-45D8-AFD1-DA80…
## $ account_id                               <chr> "88F8D694-AFF0-4CFC-BBB6-C89D…
## $ account_name                             <chr> "Chase - Plaid Money Market (…
## $ description                              <chr> "ACH Electronic CreditGUSTO P…
## $ type                                     <chr> "expense", "expense", "expens…
## $ amount                                   <dbl> -5850, -5850, -5850, -5850, -…
## $ date                                     <chr> "2021-06-08T00:00:00.000Z", "…
## $ extra_fields.category.0                  <chr> "Transfer", "Transfer", "Tran…
## $ extra_fields.category.1                  <chr> "Debit", "Debit", "Debit", "D…
## $ extra_fields.category.2                  <chr> "", "", "", "", "", "", "", "…
## $ extra_fields.category_id                 <int> 21006000, 21006000, 21006000,…
## $ extra_fields.merchant_name               <chr> "", "", "", "", "", "", "", "…
## $ extra_fields.name                        <chr> "ACH Electronic CreditGUSTO P…
## $ extra_fields.payment_channel             <chr> "other", "other", "other", "o…
## $ extra_fields.payment_meta.payment_method <chr> "ACH", "ACH", "ACH", "ACH", "…
## $ created_at                               <chr> "2021-06-18T19:14:32.075Z", "…

The transaction data doesn’t have so much engineer to do. It’s just changed timestamp and date types and replace empty spaces with NA. I also, replaced . with _ in some variable names.

Exploratory Data Analysis

The exploratory data analysis it’s divides into univariate analysis before RFM transformation and more insightful after that.

Data summary
Name db_users
Number of rows 803
Number of columns 8
_______________________
Column type frequency:
character 6
Date 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1.00 36 36 0 803 0
created_at 0 1.00 2414 5633 0 803 0
gender 751 0.06 4 10 0 4 0
country 0 1.00 2 2 0 1 0
state 179 0.78 2 2 0 27 0
city 176 0.78 4 21 0 120 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date_of_birth 173 0.78 1000-01-01 2001-01-01 1989-12-12 119

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age_at 173 0.78 33.56 40.21 19.41 29.63 31.08 33.99 1020.33 ▇▁▁▁▁

At this simple description of the data set, it’s possible to note that age has an outlier that will be replaced by NA. Also, the great number of missing data in the variable gender, turning it almost useless in the data analysis.

Univariate Analysis

Age

The customer’s age distribution is concentrated between the range 30 - 50 years as shown below:

Maybe this variable could be a good one to use in the clustering process.

State

The geographical distribution of customers is dense in the California State (31%) followed by MD(11.1%) add NC/NY tied with (7.8%). It’s important to note that more than 22% of the customers have no state information.

Users Transactions

Looking at the user transactions, I note a tiny number of different customers, just 84. Some of them (almost 20%) have more than one account, but as we expected to analyze customer behavior over account behavior I’ll treat them as the same customer.

Number of Accounts Number of Customers % of Customers
1 70 83.33
2 7 8.33
3 1 1.19
4 2 2.38
5 2 2.38
6 1 1.19
7 1 1.19

At this point it’s really important to say that just 84 customers to do the cluster analysis could be a problem for a bunch of reasons as like:

  • Some algorithms won’t provide robust results under small samples

  • The cluster interpretation could be a little messy.

  • The pattern recognition it’s hard to do

Said that I’ll develop my analysis and won’t concern about it.

Type of transacions

Analyzing the type of transactions we can note that has a mix of three different transactions. Income, Expense, and Transfer, as I’ll show in the next steps, will be treated separately to do the clustering analysis.

The amount of money transactions types have a strange behavior having a low variety of values. Transfer for example has just the value 25. It’s something that could affect the behavior of clustering analysis.

Looking at the number and amount of transactions per day by expenses, transfer and income it’s possible to note that the days 2021-06-15 (income and expense) and 2021-06-16 (income) has a lot more transactions compared to other days. The expense category it’s almost 10x the median for the period and the income it’s more than 10x.

I don’t know if this strange behavior comes from a selection bias or if this is from the data. I’ll presume that this behavior is a normal one.

Transactions Category

Looking at the transactions category we can note that we have a lot of missing ones, but part of that is caused by income transactions.

extra_fields_category_0 extra_fields_category_1 extra_fields_category_2 extra_fields_category_id n perc
NA NA NA NA 2742 24.79
Travel Taxi NA 22016000 1230 11.12
Food and Drink Restaurants Fast Food 13005032 1057 9.56
Travel Airlines and Aviation Services NA 22001000 1055 9.54
Food and Drink Restaurants NA 13005000 1053 9.52
Payment Credit Card NA 16001000 623 5.63
Food and Drink Restaurants Coffee Shop 13005043 614 5.55
Transfer Credit NA 21005000 607 5.49
Payment NA NA 16000000 443 4.01
Recreation Gyms and Fitness Centers NA 17018000 443 4.01
Shops Sporting Goods NA 19046000 443 4.01
Transfer Debit NA 21006000 425 3.84
Transfer Deposit NA 21007000 324 2.93

Almost 20% of transactions came from the travel category, divided into taxi (Uber) and Airline and Aviation.

Food and drink represent 25% divided in restaurants like coffee shops and fast foods.

Also has some transactions represented by recreation and shops and some transfer between accounts

Merchant Name

Looking deeper at the expense types and merchant names, we can see at the top Uber, McDonald’s, Starbucks, and so on.

Channel Transactions

The preferential channel to expenses is in-store (73%). Would be interesting if we had some information about online channels as app payment, website shop, etc.

Payment Methods

By payment methods we have just 4% of transactions with this data, it’s almost useless to the analysis.

RFM Transformation

After this exploratory analysis about variables and looking to a solution to the clustering problem. I decided to transform the transactional data into recency frequency and monetary value for the three types of transactions: Expense, Income, and Transfer.

After this transformation our dataset looks like this:

## Rows: 84
## Columns: 33
## $ id                                      <chr> "AD7226ED-2D26-45FD-AB37-C3823…
## $ state                                   <chr> "NY", "CA", "IL", "NY", NA, "V…
## $ age_at                                  <dbl> 34.10458, 32.52759, 33.34352, …
## $ count_transactions_expense              <dbl> 121, 120, 144, 538, 270, 0, 0,…
## $ amount_expense                          <dbl> 2825.37, 15654.90, 18785.88, 2…
## $ amount_mean_expense                     <dbl> 23.35017, 130.45750, 130.45750…
## $ count_transactions_income               <dbl> 24, 210, 252, 50, 24, 6, 6, 6,…
## $ amount_income                           <dbl> 12000.00, 585798.60, 702958.32…
## $ amount_mean_income                      <dbl> 5.000000e+02, 2.789517e+03, 2.…
## $ count_transactions_transfer             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ amount_transfer                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ amount_mean_transfer                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ n_transactions_transfer                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ n_transactions_food_and_drink           <dbl> 72, 0, 0, 246, 122, 0, 0, 0, 0…
## $ n_transactions_payment                  <dbl> 0, 0, 0, 48, 25, 0, 0, 0, 0, 0…
## $ n_transactions_recreation               <dbl> 0, 0, 0, 50, 24, 0, 0, 0, 0, 0…
## $ n_transactions_shops                    <dbl> 0, 0, 0, 48, 25, 0, 0, 0, 0, 0…
## $ n_transactions_travel                   <dbl> 49, 0, 0, 146, 74, 0, 0, 0, 0,…
## $ transactions_amount_transfer            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ transactions_amount_food_and_drink      <dbl> 2537.52, 0.00, 0.00, 53286.50,…
## $ transactions_amount_payment             <dbl> 0.0, 0.0, 0.0, 99768.0, 51962.…
## $ transactions_amount_recreation          <dbl> 0, 0, 0, 3925, 1884, 0, 0, 0, …
## $ transactions_amount_shops               <dbl> 0, 0, 0, 24000, 12500, 0, 0, 0…
## $ transactions_amount_travel              <dbl> 287.85, 0.00, 0.00, 24573.84, …
## $ transactions_amount_mean_transfer       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ transactions_amount_mean_food_and_drink <dbl> 35.24333, 0.00000, 0.00000, 21…
## $ transactions_amount_mean_payment        <dbl> 0.0, 0.0, 0.0, 2078.5, 2078.5,…
## $ transactions_amount_mean_recreation     <dbl> 0.0, 0.0, 0.0, 78.5, 78.5, 0.0…
## $ transactions_amount_mean_shops          <dbl> 0, 0, 0, 500, 500, 0, 0, 0, 0,…
## $ transactions_amount_mean_travel         <dbl> 5.87449, 0.00000, 0.00000, 168…
## $ no_channel                              <dbl> 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, …
## $ other                                   <dbl> 0.1655172, 0.0000000, 0.000000…
## $ in_store                                <dbl> 0.8344828, 0.0000000, 0.000000…

Just 84 data points (customers) and 33 variables. I drop the recency variables because I think that isn’t the goal of the analysis to explore “how much time the customer doesn’t have any transaction”.

Expense vs Income

Firstly, analyzing the behavior of the average amount of transactions in expenses and income we can note some behaviors.

Note that some customer has high average income and low average expense, while we have others that expend more than earn. This is a good spoiler about the clustering.

It’s important to note the outlier that expends much more than earn, I will keep this customer in analysis because it’s a different behavior that could be more representative if we have more data.

Expense Vs Income Vs Age

Adding the variable age to our analysis, it’s possible to note that it doesn’t affect the behavior of income and expense, so this variable wouldn’t be useful in the cluster analysis.

Expense vs Income vs Transfer

Transfer average amount it’s another variable that doesn’t affect the income and expense. As shown below:

Relationship between categorys of expenses

Looking at the correlation between category expenses it’s possible to note that every transaction has a high correlation. This analysis isn’t so conclusive because we have just 13 customers with expenses categories.

Also, I will try to do a cluster with these data points, the main idea is to show my way of thinking about the problem then try to interpret something. Because just 13 customers isn’t enough data for any analysis.

Clustering

At the clustering analysis, my goal is to make two types of clusters, one based on incomes and expenses and another one based on the categories of expense. I think this way could give us some conclusions about the data.

It’s important to say that I use just one method of clustering called kmeans, it’s based on centroid distances. This choice was made because the data sets have few data points and I would like to maintain the analysis as simple as possible.

At the beginning of every clustering, I will analyze to determine the optimal number of clusters, after that the cluster analysis will be made, and also insights about the clusters’ responses.

To determine de optimal number of clusters I run tree methods:

  • Elbow method - The idea behind this method is to compute the within-cluster variation to different number of clusters and minimize it.

  • Silhouette Method - It measures the quality of clustering based on the average similarity of each point to its own cluster compared to other clusters.

  • GAP method - The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering).

Type Transactions

Expense vs Income

We can visualize the results of methods to determine the optimal number of clusters. Two of them indicates the best number of cluster should be 6 and the elbow method indicates 4 clusters.

Using 4 clusters we can have some insights about the customers:

  • 1 - Customers with no income and no expense

  • 2 - Customers with medium income and medium expense

  • 3 - Customers with high income and low expense

  • 4 - Customers with no income and high expense

Type Clustering Cluster Number Customers
cluster_4 1 1
cluster_4 2 9
cluster_4 3 5
cluster_4 4 69
cluster_6 1 4
cluster_6 2 5
cluster_6 3 1
cluster_6 4 66
cluster_6 5 5
cluster_6 6 3

Note that using 6 clusters we have few customers in some clusters and it can cause the suggestion of the methods because the distance between members of cluster has more chance to be tiny.

Expense vs Income vs Freq Expense

Adding the frequency of expenses we can visualize better the clusters.

The methods indicate 5 or 6 clusters, I used 5 in the analysis:

Note the interpretation about the clusters:

  • 1 - Customers with medium income, lot of expense transactions, and medium amount average expense;

  • 2 - Customers with no amount expense;

  • 3 - Customers with medium income, medium transactions expense, and medium amount expense;

  • 4 - Outliers, the customer that has high amount expense in few transactions and has no income;

  • 5 - Customer with high income and low average amount expense with medium transactions.

At this point, we have some good interpretations about users transactions. Let’s see about expenses categories.

Category Expense

We have just 18 customers with expenses, so it’s hard to extract some helpful interpretation or relationship in the data. But let’s look to clustering methods.

At this point, I use the relative amount mean expend to the total amount. The same analogy to the frequency.

Amount Type Expense

The methods indicate just 2 clusters relatives to the amount mean expense in each category.

Cluster Transfer Food and Drink Payment Recreation Shops Travel
1 0.09 0.30 0.18 0.01 0.04 0.07
2 0.69 0.08 0.15 0.01 0.04 0.04

The only difference between groups looks that Group 2 expends more on transfers and group one on food and drink. But nothing too conclusive.

Frequency Type Expense

The methods indicate just 2 clusters relatives to the amount mean expense in each category.

Cluster Transfer Food and Drink% Payment Recreation Shops Travel
1 0.09 0.31 0.04 0.04 0.04 0.20
2 0.15 0.37 0.07 0.07 0.07 0.22

There aren’t any conclusions about the groups, the expenses are really similar.

Conclusion

Our goal with this analysis was to find some group of customers that has similar behavior in the transaction data. I led the analysis using the approach o frequency and monetary value expenses, incomes, and transfers. Initially, we had 803 customers to cluster after they join between data sets left just 84 customers with the transaction.

Based on it, I achieve some good results in clustering customers using income and expenses. A couple of behaviors were detected. Unfortunately using the expenses data there was no conclusion about the groups, it’s an important stand out that we had just 18 customers, something that makes it hard to reach closure.

Notes

This take-home assignment was very challenging because it’s an open problem and I could have a lot of different ways to do the analysis, My choice was to explore the RFM customer behavior and in the end just FM.

I choose to explore just one method of clustering to make things simple. But in a workday problem, I would test some other methods like hierarchical clustering and principally density-based clustering like Dbscan and Hdbscan, because this method has the advantage of not classifying some data points with the idea of “Not all customers belongs a cluster”.

I would like to thank the Origin team for dedicate time to elaborate on this problem and give me the chance to show my work.